import warnings
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")
df = pd.read_csv('hotel_bookings.csv')
df.head(5)
df.info()
df.isnull().sum().sort_values(ascending=False)
df.describe()
df.hist(figsize=(20,14))
plt.show()
nan_replacements = {"country": "Unknown", "agent": 0, "company": 0}
df = df.fillna(nan_replacements)
df["meal"].replace("Undefined", "SC", inplace=True)
nan_value = float("NaN")
df.replace(nan_value, 0.0, inplace=True)
# Some rows contain entreis with 0 adults, 0 children and 0 babies.
zero_guests = list(df.loc[df["adults"]
+ df["children"]
+ df["babies"]==0].index)
df.drop(df.index[zero_guests], inplace=True)
df.drop_duplicates(inplace=True)
plt.figure(figsize = (16,16))
ax=sns.heatmap(df.corr(), annot=True)
plt.show()
target_correlation = df.corr()['is_canceled'].sort_values(ascending = False) #.abs()
target_correlation
df = df.drop(['reservation_status_date'],axis=1)
num_features = ['lead_time','arrival_date_year','arrival_date_week_number','stays_in_weekend_nights',
'stays_in_week_nights','adults','children','babies','is_repeated_guest','previous_cancellations',
'previous_bookings_not_canceled','booking_changes','days_in_waiting_list','adr','required_car_parking_spaces',
'total_of_special_requests']
cat_features = ['hotel', 'arrival_date_month','meal','country','market_segment',
'distribution_channel','reserved_room_type','assigned_room_type','deposit_type','agent',
'company','customer_type','reservation_status']
features = num_features + cat_features
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
cat_data = df[cat_features]
for i in cat_data:
cat_data[i] = le.fit_transform(cat_data[i])
cat_data
num_data = df[num_features]
num_data['children'] = num_data['children'].astype('int')
X = pd.concat([cat_data, num_data], axis = 1)
y = df['is_canceled']
print(X.shape,y.shape)
X.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)
rf_model_enh = RandomForestClassifier(n_estimators=160,
max_features=0.4,
min_samples_split=2,
n_jobs=-1,
random_state=0)
rf_model_enh.fit(X_train, y_train)
y_hat = rf_model_enh.predict(X_test)
rf_model_enh.score(X_test, y_test)
print(y_hat[y_hat != y_test].size)
Wow, what an amazing model! :)
TO-DO:
I've chosen the observation below for the prediction and explanation:
i1 = 5252
X_test.iloc[[i1]]
print("Prediction:", rf_model_enh.predict(X_test.iloc[[i1]]))
print('True value:', y_test.iloc[i1])
As expected, the prediction was correct.
Let us create an explainer object: we are dealing with a classification task, hence the mode. The class names I chose are Not Canceled for 0 and Canceled for 1.
from lime.lime_tabular import LimeTabularExplainer
explainer = LimeTabularExplainer(X_train.values,
mode='classification',
feature_names=X_train.columns,
class_names=['Not Canceled', 'Is Canceled'],
verbose=True,
random_state=21)
Now it's time to make the first explanation. However, we need to modify the prediction function first, as the default one is not compatible with the lime explainer. I've limited the number of features to 15 after a couple of executions with different parameters and examining their results.
predict_fn_rf = lambda x: rf_model_enh.predict_proba(x).astype(float)
explanation = explainer.explain_instance(X_test.iloc[[i1]].values[0], predict_fn_rf, num_features=15)
explanation.show_in_notebook()
We observe that most variables have little or no impact on target. On the left, we see prediction probabilities, in the middle are weights the scales of influence for each variable in descending order. On the right, we see how each of the variables is in favour of either 1 (orange colour) or 0 (blue). The reservation_status variable has the strongest negative influence, deposite_type and previous_cancellations variables have a slight negative impact, a required_car_parking_spaces variable has some positive impact.
plt.rcParams["figure.figsize"] = (16,35)
with plt.style.context('seaborn'):
explanation.as_pyplot_figure()
On the plot we see the same results, but it's easier to read.
I chose two additional observations to compare their explanations to the first one.
i2 = 2525
X_test.iloc[[i2]]
print("Prediction:", rf_model_enh.predict(X_test.iloc[[i2]]))
print('True value:', y_test.iloc[i2])
explanation2 = explainer.explain_instance(X_test.iloc[[i2]].values[0], predict_fn_rf, num_features=15)
explanation2.show_in_notebook()
Here we can notice that required_car_parking_spaces has a negative impact, and the weights of deposite_type and previous_cancellations variables have changed. days_in_waiting_list also has some weight, while in the first example, it did not have.
with plt.style.context('seaborn'):
explanation2.as_pyplot_figure()
i3 = 5225
X_test.iloc[[i3]]
print("Prediction:", rf_model_enh.predict(X_test.iloc[[i3]]))
print('True value:', y_test.iloc[i3])
In this case, we receive a prediction that booking is going to be canceled, so it might be interesting to take a look at the explanation of such prediction.
explanation3 = explainer.explain_instance(X_test.iloc[[i3]].values[0], predict_fn_rf, num_features=15)
explanation3.show_in_notebook()
Again, nothing is even close to the impact that reservation_status has. There it has 0 value, so the impact is positive. We can also notice that required_car_parking_spaces have a positive impact. Interestingly, in this situation, the babies variable has more influence than the deposite_type and previous_cancellations variables. Moreover, there way many variables have some impact on the target (have non-zero values in lime decomposition).
with plt.style.context('seaborn'):
explanation3.as_pyplot_figure()
As it was expected, the reservation_status is the most important variable, which is easy to understand, even without using any complex models for prediction. The interesting fact in the results that we obtained is how the number of parking spaces affects the prediction, as for me this is definitely not something evident. I believe that previous cancellations and days spent on the waiting list are the factors that hotels would also take into attention when they receive a booking request.
When it comes to the explanation stability, we see that some of the variables would have different impact on the prediction even with the same values (previous cancellations and days on the waiting list in case 1 and 2),however,I think this might be because of the change of required_car_paring_spaces variable's value. There are no significant changes in those weights simply because of the reservation_status importance.
I also want to say that I'm really surprised by how well the model predicted the observations this time; for me, the results are almost suspicious, so might take one more closer look at the process later.
The hw 2 was really interesting and I definitely learned a lot during the process, so thank you and have a great day :)